308 research outputs found
Exploiting the Statistics of Learning and Inference
When dealing with datasets containing a billion instances or with simulations
that require a supercomputer to execute, computational resources become part of
the equation. We can improve the efficiency of learning and inference by
exploiting their inherent statistical nature. We propose algorithms that
exploit the redundancy of data relative to a model by subsampling data-cases
for every update and reasoning about the uncertainty created in this process.
In the context of learning we propose to test for the probability that a
stochastically estimated gradient points more than 180 degrees in the wrong
direction. In the context of MCMC sampling we use stochastic gradients to
improve the efficiency of MCMC updates, and hypothesis tests based on adaptive
mini-batches to decide whether to accept or reject a proposed parameter update.
Finally, we argue that in the context of likelihood free MCMC one needs to
store all the information revealed by all simulations, for instance in a
Gaussian process. We conclude that Bayesian methods will remain to play a
crucial role in the era of big data and big simulations, but only if we
overcome a number of computational challenges.Comment: Proceedings of the NIPS workshop on "Probabilistic Models for Big
Data
Herding as a Learning System with Edge-of-Chaos Dynamics
Herding defines a deterministic dynamical system at the edge of chaos. It
generates a sequence of model states and parameters by alternating parameter
perturbations with state maximizations, where the sequence of states can be
interpreted as "samples" from an associated MRF model. Herding differs from
maximum likelihood estimation in that the sequence of parameters does not
converge to a fixed point and differs from an MCMC posterior sampling approach
in that the sequence of states is generated deterministically. Herding may be
interpreted as a"perturb and map" method where the parameter perturbations are
generated using a deterministic nonlinear dynamical system rather than randomly
from a Gumbel distribution. This chapter studies the distinct statistical
characteristics of the herding algorithm and shows that the fast convergence
rate of the controlled moments may be attributed to edge of chaos dynamics. The
herding algorithm can also be generalized to models with latent variables and
to a discriminative learning setting. The perceptron cycling theorem ensures
that the fast moment matching property is preserved in the more general
framework
Learning in Markov Random Fields with Contrastive Free Energies
Learning Markov random field (MRF) models is notoriously hard due to the presence of a global normalization factor. In this paper we present a new framework for learning MRF models based on the contrastive free energy (CF) objective function. In this scheme the parameters are updated in an attempt to match the average statistics of the data distribution and a distribution which is (partially or approximately) "relaxed" to the equilibrium distribution. We show that maximum likelihood, mean field, contrastive divergence and pseudo-likelihood objectives can be understood in this paradigm. Moreover, we propose and study a new learning algorithm: the "kstep Kikuchi/Bethe approximation". This algorithm is then tested on a conditional random field model with "skip-chain" edges to model long range interactions in text data. It is demonstrated that with no loss in accuracy, the training time is brought down on average from 19 hours (BP based learning) to 83 minutes, an order of magnitude improvement
Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference
We describe an embarrassingly parallel, anytime Monte Carlo method for
likelihood-free models. The algorithm starts with the view that the
stochasticity of the pseudo-samples generated by the simulator can be
controlled externally by a vector of random numbers u, in such a way that the
outcome, knowing u, is deterministic. For each instantiation of u we run an
optimization procedure to minimize the distance between summary statistics of
the simulator and the data. After reweighing these samples using the prior and
the Jacobian (accounting for the change of volume in transforming from the
space of summary statistics to the space of parameters) we show that this
weighted ensemble represents a Monte Carlo estimate of the posterior
distribution. The procedure can be run embarrassingly parallel (each node
handling one sample) and anytime (by allocating resources to the worst
performing sample). The procedure is validated on six experiments.Comment: NIPS 2015 camera read
Bayesian Structure Learning for Markov Random Fields with a Spike and Slab Prior
In recent years a number of methods have been developed for automatically
learning the (sparse) connectivity structure of Markov Random Fields. These
methods are mostly based on L1-regularized optimization which has a number of
disadvantages such as the inability to assess model uncertainty and expensive
cross-validation to find the optimal regularization parameter. Moreover, the
model's predictive performance may degrade dramatically with a suboptimal value
of the regularization parameter (which is sometimes desirable to induce
sparseness). We propose a fully Bayesian approach based on a "spike and slab"
prior (similar to L0 regularization) that does not suffer from these
shortcomings. We develop an approximate MCMC method combining Langevin dynamics
and reversible jump MCMC to conduct inference in this model. Experiments show
that the proposed model learns a good combination of the structure and
parameter values without the need for separate hyper-parameter tuning.
Moreover, the model's predictive performance is much more robust than L1-based
methods with hyper-parameter settings that induce highly sparse model
structures.Comment: Accepted in the Conference on Uncertainty in Artificial Intelligence
(UAI), 201
GPS-ABC: Gaussian Process Surrogate Approximate Bayesian Computation
Scientists often express their understanding of the world through a
computationally demanding simulation program. Analyzing the posterior
distribution of the parameters given observations (the inverse problem) can be
extremely challenging. The Approximate Bayesian Computation (ABC) framework is
the standard statistical tool to handle these likelihood free problems, but
they require a very large number of simulations. In this work we develop two
new ABC sampling algorithms that significantly reduce the number of simulations
necessary for posterior inference. Both algorithms use confidence estimates for
the accept probability in the Metropolis Hastings step to adaptively choose the
number of necessary simulations. Our GPS-ABC algorithm stores the information
obtained from every simulation in a Gaussian process which acts as a surrogate
function for the simulated statistics. Experiments on a challenging realistic
biological problem illustrate the potential of these algorithms
- …